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Given two sequences over a finite alphabet C, the D2 statistic 
is the number of m-letter word matches between the two sequences. 
This statistic is used in bioinformatics for expressed sequence tag 
database searches. Here we study a generahzation of the D2 statistic 
in the context of DNA sequences, under the assumption of strand 
symmetric BernouUi text. For fc < m, we look at the count of m-letter 
word matches with up to k mismatches. For this statistic, we compute 
the expectation, give upper and lower bounds for the variance and 
prove its distribution is asymptotically normal. 

1. Introduction. Methods for alignment-free sequence comparison are 
among the more recent tools being developed for sequence analysis in biol- 
ogy [16]. A disadvantage in the classical Smith- Waterman local alignment 
algorithm [13], as well as the popular search algorithms such as FASTA and 
BLAST, is that they assume conservation of contiguity between homologous 
segments. In particular, they overlook the occurrence of genetic shuffling 
[18]. Alignment-free sequence comparison methods are used to compensate 
for this problem. 

A natural alignment-free comparison of two sequences is the number of 
m-letter word matches between the sequences. This statistic, called D2, 
can be computed in linear time in the length of the sequences, which is 
also an advantage over the nonlinear local alignment algorithms. D2 is used 
extensively for EST sequence database searches (e.g., [2, 3, 11] and in the 
software package STACK [6]). 

In [10], Lippert, Huang and Waterman started a rigorous study of D2 
using the model of independent letters in DNA sequences. A formula for 
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the expectation was computed as well as upper and lower bounds for the 
variance. Limiting distributions, as the length of the sequences, n, and the 
size of the word, m, get large, were derived in some cases. The authors used 
Stein-Chen methods [5, 9, 14] to obtain the following results. When the 
underlying distribution of the alphabet is nonuniform, the distribution of 
D2 has normal asymptotic behavior when m/log^n < 1/6. The logarithmic 
base b is defined by 6 = {J2a<^c Ca)~^) where is the probability of a letter 
taking the value a. Following simulations, it was noted in [10] that the bound 
1/6 above is too small. Our simulations in Section 6 suggest that the bound 
should be closer to 2. 

Another asymptotic regime was identified in [10] when m/log^n > 2. In 
this case, the distribution of D2 has compound Poisson asymptotic behav- 
ior. However, as pointed out in [17] and [1], the Poisson approximation is 
meaningful in this region only when E{D2) is not too small. To control this 
degenerate case, one needs to add the linear restriction m = 21og^n + C. 

When the underlying distribution of the alphabet is uniform, it was proved 
in [9] that for m = alogf^n + C with < a < 2, the distribution of D2 is also 
asymptotically normal. 

A natural generalization of the D2 statistic is to count the number of 

(k) 

approximate m-word matches. For k <m, let D2 be the number of m-word 
matches with up to k mismatches between the two sequences. This statistic 
can be expressed in terms of a distance function. One can define the distance 
between two m-words to be the number of mismatches. A /c-neighborhood of 
an m-word w is then all m-words that are at most k distance from w. The 

(k) 

D2 statistic is the number of A;-neighborhood matches of m-words between 
two sequences. 

In [12], Melko and Mushegian studied the A:-distance and fc-neighborhood 
match count between a probe of length m and a random DNA sequence, 
under the assumption that the sequence is strand-symmetric Bernoulli text. 
They gave a formula for the expectation of the /c-distance match count 
and the /c-neighborhood match count. Melko and Mushegian suggested that 
methods of Lippert, Huang and Waterman in [10] could be used to obtain 

(k) 

upper and lower bounds for the variance of D2 and to analyze its asymp- 
totic behavior. 

(k) 

In this paper we study the D2 statistic under the strand-symmetric 
Bernoulli text assumption. We extend the method of [10] to give upper and 
lower bounds for the variance. We analyze the asymptotic behavior of the 
distribution of 1^2'^^ as n and m increase using the method of cumulants [8] 
rather than Stein's method. For D2, the A; = case, this method improves 
the bound on m/log^n obtained in [10] from 1/6 to 1/2. 

The organization of this paper is as follows. In Section 2 we review def- 
initions and introduce notation. In Section 3 we discuss the mean of 
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Section 4 is devoted to the variance of -Dg ■ Section 5 we prove normal 

(k) 

asymptotic behavior of the distribution of D2 ■ Section 6 contains the results 
of numerical simulations, and a concluding summary is given in Section 7. 
A list of notations is provided at the end of Section 7. 

2. Preliminaries. Let C = {A, G, C, T} with strand-symmetric probabil- 
ity measure ^ = {(,a, £,Gi (,Ci Ct} and perturbation parameter r/. That is, — 1 < 
rj <1 IS the unique number satisfying 



Let A = A1A2 ■ ■ ■ An and B = B1B2 - ■ - be two random sequences of 
length n over the alphabet C. We assume that A and B are Bernoulli 
texts, meaning, the letters (nucleotides) are independent and identically dis- 
tributed (i.i.d.). We note that the assumption of both sequences having the 
same length is not essential for what follows and its main purpose is to 
simplify notation. Our results can be easily adapted to the case when the 
sequences are of different lengths. 

Definition 2.1. Let x and y be two words of length m. We define the 
distance between x and y to be 

5(x, y) = number of character mismatches between x and y. 

For k < m, we say that x is a k-distance match of y if 6{x,y) = k. When 
^(x, y) < k, then x is said to be a k-neighbor of y. 

Following the terminology and notation of [12], we have the following 
definition. 

Definition 2.2. In the above setup, define the perturbed binomial dis- 
tribution with perturbation parameter rj by 

9k{m, 7], c) = h{m, rj, c)uk{m, rj, c), 

where < c,k < m are integers and 



eT = i(i+r?), 



h{m, rj, c 



m—c 



Uk{m,r],c) 



m—k / \ / 
i=0 ^ ^ ^ 



m — k — i 



m — c 



) 



Vk{i,r],c) 
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For an m-word w with GC-count Cw, h{m, rj, Cw) is Pr(w), the probabihty of 
seeing w. In the definition of Uk{m, rj, c), and in similar situations throughout 
the paper, we fohow a general convention that (^) = if a < or a > n. 

Note that when = 0, the perturbed binomial distribution is the binomial 
distribution with gk{m,0,c) = bk{m, 1/4), where 

As observed in [12], if T is a strand-symmetric Bernoulli text of length 
m and q is a (known) query text (=word) of length m, then the probability 
distribution of the distance (5(T,q) is a perturbed binomial distribution: 

Pr(5(T, q) = fc) = c/fc(m, r], c), 

where c is the GC-count in q and rj is the perturbation parameter of T. Let 

k 

Gk{m,r],c) =J29r{m,r],c) =Pr((5(T,q) <k) 

r=0 

be the cumulative distribution function of the distance. 

2.1. k -neighborhood matches. Let A and B be two DNA sequences of 
length n. Assume the sequences are strand-symmetric Bernoulli text with 
perturbation parameter rj. Let 0<k<m<nhe integers. 

Definition 2.3. Define the statistic 0^2^^ = D{k,m,n) to be the num- 
ber of fc-neighborhood m-word matches between the sequences A and B, 
including overlaps. Note that D2^^ is the D2 statistic of [10]. 

(k) 

The Df' statistic may be computed as follows. 

Notation. For 1 < s <t <n, write A[s, t] for the subsequence AgAg-^i . . . 
At. 

(k) 

Definition 2.4. Let Y-^ be the ^-neighborhood match indicator (start- 
ing) at position (position i in sequence A and j in B). That is, 

y{k) ^ r 1, if d{A[i, i + m- l],B[j, j + m-l])<k, 
\ 0, otherwise. 

Then the d!^'' statistic can be computed via 

(*,i)6/ 
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where the index set / is 

/ = GNxN:l<i<n-m + l, and l<j<n-m + l}. 

For convenience, we write n for n — m + 1. 
(k) 

3. The mean of • In this section we give a general formula for the 

(k) 

mean of in terms of the perturbed binomial distribution and obtain 

estimates for it. The estimates will be used in later sections in order to 

(k) 

prove normal asymptotic behavior of D2 ■ 

(k) 

First we compute the mean of Y^j ': 

^Kf ] = Pr(llf = 1) 

= ^ Pr(5(A[i,z + m - l],vir) < A:)Pr(B[j, j + m - 1] = w) 

= X! Gk{m,r],c^)Vi{Mv), 

where Cw is the GC-count of w. 

From this we get formulas for the expectation of D2 ■ 



71^ 



Pr(w)Gfc(?n,7?,Cw). 



Remark 3.1. When k = 0, we have Gfc(m, 77, Cw) = Pr(w) and 



^(0)1 
■ij 



This agrees with the formula given in Lippert, Huang and Waterman [10], 
E[Yi,]=pl^\ where P2 = Eae£C^ 

Definition 3.2. For t > 1, let 

Remark 3.3. Note that pt = E[{^xY~^]- Hence, by the Cauchy-Schwarz 
inequality, 

P3 >pI, 

where equality holds if and only if the distribution is uniform: = 1/|£|. 



6 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 

3.1. Estimates. The purpose of these estimates is to explain the asymp- 
totic behavior of -Dg*^^ , rather than to provide a computational tool. Hence, 
these estimates are by no means optimal. 

First we estimate the function gi^(^Tn,Tj, c-^) . Without loss of generality, 
assume > 0. From Definition 2.2, we have upper bounds: 



Vk{i,ri,c) < 



Ukim,r],c) < —— J2 



3 + r?\^ 
1 — r] 

"J + ^ \ I ^\ I iTi — c \ ( ^ + ( m 



I — rj J \i J \m — k — i J \1 — i] J \ k 



Remembering that h{m,r],c-w) = Pr(w) and using similar estimates for the 
lower bound we get 

i, h;-— <5'fc("i,r/,Cw) <Pr(w) ' 



K J \1 +riJ \ K J \l — T] 

and 

Pr(w)E(7)(^)'<G.K,,c.) 

(1) 

< Pr(w) 2^ ' 

r=0 



1 — 7] 



Hence, for E[Y^f^] =J2^|=oy^PT{w)Gk{m,r],c^), we have 
Finally, since E[d!^^] =J2fj^^E[Y^J'^], we have 

Remark 3.4. For k = (the exact matches case), (3) gives E[d!^^] = 
n^p™, which agrees with the expectation computed in [10]. 

Remark 3.5. When r] = (the uniform case), P2 = \ and the upper and 
lower bounds in (3) are equal. Hence, 
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k 

r=0 
k 



3 \ ^' / 1 ^ Tn—r 

r I V 4 j U, 



r=0 

4. The variance. In this section we give a lower and upper bound for 

(k) 

the variance oi D2 ■ The lower bound is used later to prove asymptotic 

(k) 

normality of D2 ■ The upper bound is not optimal, but is comparable with 
that given in [10] for the A; = case. We start this section by stating the 
main results in Propositions 4.1 and 4.2. We then prove several technical 
lemmas and finish with the proofs of Propositions 4.1 and 4.2. 



Proposition 4.1. Var(L»^ ^ >^ = 

l{n, m, k), where 

2k 



■ n 



(2n-4m + 2)(m-l)A;2^ ^ ^r'"^2m^P3 



l + V 



Proposition 4.2. 



Var(D^ < ^ (2ra - 4m + 2)m^''' i^- 
(4) -n2(2n-4m + 2 



.Y3 + r/ 



2k 



2pi 



-7] 



2P3 



1 -P3 



P3 



P2 



To compute the variance oi D2 = J2{i,j)GiYij ; we need to compute the 
covariances CoY{Yff\Y},'^}).Fov this, we use techniques from [17]. To shorten 
the indices' notation, let u = and v = 

In the following definition we use notation and terminology from [17], 
Chapter 11. 

Definition 4.3. Let Ju = {v = -.{i' — i\ <m or \j' -j\ < m}. Then 
Ju is the dependency neighborhood of Yu in the sense that if v ^ Ju, then 

(k) (k) 

Y^' and Y^'^' are independent. The dependency neighborhood can be de- 
composed into two parts, accordion and crabgrass, Ju = where 

Ju={v= eJu:\i'-i\<m and \j' -j\<m} and Jl = Ju\ 
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Let -u G /. When v i Ju, Cov{Y^ \Yr') = 0. To estimate Cov(yr\yrO 
when V & Ju, we look at the two cases: v £ (crabgrass) and v £ (ac- 
cordion). We will see that crabgrasses contribute the dominant term of 

Var(£»f ^) in the cases we are interested in, that is, for m = O(logn). Hence 

for accordions we only give a crude approximation of Cov{Yu^\Yv''^). We 
start by proving the following positivity lemma. 

Lemma 4.4. (i) For v G J^, Y^''^ and Y^''^ are nonnegatively correlated. 
That is, 



(ii) For V in the main diagonal of J^, Cow{YT,Yr')>Q. 

Proof. We will use the following notation. For r > define 

Y^p {r) = the fc-neighbor match indicator between two r- words at (i, j). 
\i[r) = 5[A[i,i + r- 1], j + r - 1]) 

= number of mismatches in an r-word match at {i,j). 
^2('") = (^(^[^' + m — r,i' + m — V\,B[j' + m — r,j' + m — 1]) 
= number of mismatches in an r-word match 
at {i' + m — r,j' + m — r). 

To prove part (i), let u= {i,j), v = {i' G J^. Write t = i' — i and s = j' — j. 
By symmetry, we may assume v is in the first quadrant of J^, that is, 
< t < m — 1, and m< s. We have 



Cov(yW,yW)>o. 



Pr(yW = i,yi'=) = i) 



J2 Pr(Ai(t) = Zi)Pr(A2(t) = /2) 



ll,l2=0 



(5) 



xPr(y(f;|^,))(m-t) = i,y( 



{i'd') 



{rn-t) = 1) 



^ Pr(Ai(t) = /i)Pr(A2(t)=/2) 



;i,«2=o 



X ^ Pr(w)Gfc_i,(m-t,?7,Cw)Gfc_i2("^-*'^.Cw) 



-wG£'"-* 
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J2 Pr(Ai(t) = li)Gk^i, (m - t, rj, Cw) 



L h 



For wG/:"^-*, let 

(6) (w) = E Pr(A(t) = (m - t, r?, Cw) , 

I 

where A(t) is the distance between two random t- words. Then (5) says that 

i?[yWyW] = i<;[(/,(M^))2]. 

Similarly we get 

E[Y:^'^^]E[Y,^'^] = E[ftiW)f. 

Hence, 

(7) Cov(yj'=), y«) = VariftiW)) > 0. 

For part (ii), let u = and v = G J^'s main diagonal, that is, 

V = {i + t,j + 1) with |t| < m — 1. By symmetry, we may assume <t. As 
before, let Ai(t) be the number of mismatches in a t-word match at 
and let A2{t) be the number of mismatches at (i + m,j + m). Then 

£;[yWyW]= ^ Pr(Ai(t) =/i)Pr(A2(t) = ^2) 

X Pr(yJ'=-'i)(m - t) = l,yj'=-'2)(m - t) = 1) 

t 

= E Pr(Ai(t)=/i)Pr(A2(t) = /2) 

^1,^2=0 

X Pr(yJ"^^"^*^-^i'*^-'2})(^_i) = 1) 

and 



E[yP]E[Y^^^]= E Pr(Ai(t) = /i)Pr(A2(t) = /2) 

/l,«2=0 



X Pr(yJ'=-'i)(7n -t) = l)Pr(yJ'^-'2)(m - t) = 1). 



Since 



p^(yJmin{fc-ii,fc-«2})(^ - t) = 1) 

> Pr(yJ'=-'i)(m - t) = 1) Pr(yJ^-'2)(,„ -t) = l), 
we have that Cov(yi''^ , yj''^ ) > 0. □ 
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Remark 4.5. From (7) we have that for v G J^, Cov(yi''\yi^^) = 
Var(/t(VF)) for appropriate t. When computing /t(w) in (6), it is worth 
noting the following: 

1. Pr(A(t)=0 = ExG£'Pr(x)gKi,r/,Cx)=Exe/:'Pr(x)2ni(t,r?,Cx).R:om Sec- 
tion 3.1 we also have 



Hence 



2. For k — I >m — t and w G /Z"'"*, Gk~i{m — t, r], Cw) = 1- 

Remark 4.6. When k = (exact matches case), and v € with t as 
above, we get: 
For wG/:™-*, 

/t (w) = Pr(A(t) = 0)Go(m - t, r], Cw) = pI Pr(w). 
Since E[Fv{W)] = Ewe£--' PKw)^ = p™"* and E[Pr(VF)2] = 
Ewe£'"-* Pr(w)^ =^3*"*' we have that 

Cov(yW,yW) = Var(A(VF)) =prVar(Pr(w)) =pr(Pr* "P?""*^)- 
This agrees with the computations in [17], Section 11.5.2. 

Remark 4.7. When rj = (uniform case), /t(w) does not depend on w 
and hence Var(/f(l^)) = 0. Therefore, in this case, fort;G J^,Cov(yi'=\yi'=)): 
0. 

Next we look at the following special crabgrass case. 

Lemma 4.8. Let u = and v = {i' G with t = \i' — i\ = m — 1 
or (by symmetry) \j' — j\ = m — 1. Then 

Cov(yW,yW) = [Pr(A(m - 1) = k)]\p,-pl), 

where A(m — 1) is the distance between two random {m — 1) -words. 
Hence, by Remark 4-5, 



m-l\ fS-r]\'']^ 

P2 



k J \l+r] 



r(|-i)<Cov(yi'=),yW) 



< 



m — 1\ f 3 + r] 



k J \l-r] 



.1"(|-). 
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Proof. By (7), 

Cov(yW,yJ^'))=Var(/i(Ty)). 
Let w £ C. Then, using Remark 4.5, 

/^(w) = ^ Pr(A(m - 1) = l)Gk^i{l, ri, c^) 

1=0 

k-l 

= Pr(A(m - 1) = A;)Go(l, V, Cw) + Pr(A(m -l) = l)-l 

1=0 

k-l 

= Pr(A(m - 1) = A;)^w + Pi'(A(m - 1) = 0- 

/=o 

Note that Pr(A(m - 1) = k) and J2'i=o Pr(A(m -l) = l) do not depend on 
w. Hence, 

Var(/t(VF)) = Var ^Pr(A(?n - 1) = k)Cw + J2 ^^(^("1 - 1) = 0^ 

= [Pr(A(m - 1) = k)fYaT{^w)- 
As noted before (Remark 4.6, with m — t = I), Var(^vy) =P3 ~ P2- ^ 

For the accordion case, we use the fohowing crude estimate. 

Lemma 4.9. For u, v G /, Cov(yi^\ yj''^) = 0{p^m''). 

Proof. 



(8) 



|Cov(yj'=),yW)| < Vvar(yi'))Var(yi'^)) 

= Var(yW)<i?[(yW)^]=i?[yi^)] 



< 



r=0 



f?Ec;j(^^) fro- (2 



4.1. Proof of Proposition 4.1. We now prove the lower bound formula 
for \ax{Df'^) stated in Proposition 4.1. 
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Proof of Proposition 4.1. First we split the variance into the con- 
tributions of crabgrasses and accordions: 

varrf))= Y: Cov(yW,yW) 

= E E Cov(yW,yW) + 5: y: Cov(yW,yj'^)). 

Next we look at the crabgrasses: 

> E E 

U=(i,j)€l v={i' ,j')(iJ^ 

\i' —i\=m—l or \j' —j\=m—l 

> E E 

u=(i,j)ei «=(i'j')eJS 

\i' —i\=m—l or \j'—j\=m—l 



Cov(yi'=), yj'^)) by Lemma 4.4 



1 + r/ 



P2 



by Lemma 4.8 



(2n -4m + 2) 



k 



l + V 



2k 



\P2 



Finally we consider the contribution of the accordions to the variance: 



E E Cov(yW,yi'^)) 



<EE lcov(yw,yi^))l 



^ E E ofe 



by Lemma 4.9 



Then 

Var(D5'=)) = E E Cov(yW,yJ'=)) +0(n2m'=+2p™) 



>2n^ (2n-4m + 2) 



m- 1\2 /3-77\2^ 2mYP3 



1 + 7? 



P2 



□ 



4.2. Proof of Proposition 4.2. The first two terms in (4) come from crab- 
grasses. Let u = {i,j) G I, V = {i',j') € with i' = i + t, <t <m — I. We 
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need to bound Cov(yi''\ fJ''^) = Var(/t(P^)): 



E[ft{Wf]= P^(^) 
< J2 Pr(w)= 



'm—l 



^FiiAit) = l)Gk^iim-t,ri,c^) 

=0 

71—1 k—l 



=0 



r=0 



m — t\ fS + rj 
r J [T~^ 



by (1) 



(m-t)'' 
<P3 



{m-ty 



'm—l "1 2 

EPr(A(t)=/) 



E Pr(w)= 



/3 + 7?V 



fel 2 



and similarly, 



m—l 



E Pr(w) E Pr(A(i) = ^^^-/(m - t, r?, Cw) 



> 



> 



m — l fc— i 

5: Pr(w)2^Pr(A(i)=/)E 

n 2 



m-t\ f^-ri 
r ) [l + r] 



m — l 



Y: Pr(w)2^Pr(A(t)=0 
Lwe£™-t /=0 
2 



E 



m-ti2 



Hence, 



\1 — 7] J 

Summing up over all n's and f 's and using 



J2 E = ^^(2^ - 4m + 2) 



2g 



1 -g" 
1-Q 
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2k 



with q = P3 and p^j respectively, yields 



2k 



3 + 7] 

1 — r] 



2P3 



1 -P3 



P3 



-n^(2n-4m + 2) 
The last term in (4) comes from accordions: 



^ ^ Cov(yW,yJ'=))<5: 5: |Cov(yW,yW)| 



■ n 



'(2m-l)2p-^ 



r=0 



3 + 7? 

1 — r] 

3 + r? 
1 — r] 



from (8) 



5. Asymptotic behavior. We will need the following central limit theo- 
rem of Janson [8] for certain sums of dependent random variables. To state 
it, we first recall the definition of dependency graph. 

Definition 5.1. A graph T is a dependency graph for a family of ran- 
dom variables if the following holds: 

1. There is a one-to-one correspondence between the random variables and 
the vertices of the graph. 

2. If Vi and V2 are two disjoint sets of vertices of F such that there is no 
edge (fi,f2) in F with vi G Vi and V2 € V2, then the corresponding sets 
of random variables are independent. 

Also recall that the maximal degree of a graph is the maximal number of 
edges attached to a single vertex. 

Theorem 5.2 ([8], Theorem 2). Suppose that for each n, {Wni}f=i is 
a family of hounded random variables; \Wni\ < C„ almost surely. Suppose 
further that F„ is a dependency graph for this family and let M„ he the 
maximal degree of F„ [if F„ has no edges, set Mn = !)• Let Sn = Y^f=i Wni 
and = Var(S'„). If there exists an integer t such that 



(9) 
then 



{Nn/Mnf''MnCn/cTn ^0 aS 



n—>-oo, 



{Sn- E{Sn))/ an ^M{0,1) 



as n — oo. 
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Next we state and prove our main theorem. 



Theorem 5.3. Assume that the four letters of the alphabet C are not 
uniformly distributed. That is, the perturbation parameter r] is not zero. Let 

fin = E[d!^^] and Un = V^Varpf ^). 

For m = a log^^^^^ i^) + C with < a < 1/2 and C a constant, and fixed k 
such that < k < m, 



D 



(k) 



M{0, 1) as oo. 



Proof. We apply Theorem 5.2 to the match indicator random variables 

(k) 9 

y> ^ In this case, the dependency graph has n vertices and edges may be 
defined by connecting the vertex {i,j) with (i', j') if |z' — i| < m or \j' — j\ < 
m. Hence, in the notation of Theorem 5.2, A'^^ = n^; C„ = 1; the maximal 
degree of r„ is the size of a dependency neighborhood: 



\Ju\ = (2m - l)(2n - 2m + 1); 



(fc) 



and Sn = D.^ 

Let m = a\ogiip^{n) + C with < a (and k fixed). Then for a < 1, the 
lower bound, i, for cr^ in Proposition 4.1, has the property 



(2n -Am + 2) 



7] 



2k 



P2 



2m P3 



= Cm^"^" + 0(n2-"(log(n))'=+2 
~ 7^3-2a since a < 1. 
Therefore, in condition (9) we have 

iNjMn)^/'MnCn 



1 



+ 0(nV=+2p^^ 



where Ci > is a constant 



(nV((2m - l)(2n - 2m + l)))^/*(2m - l)(2n - 2m + 1) 



< 



n 



(2m-l)(2n-2m + l) 



<7n 

1/t 



(2m-l)(2n-2m + l) 



(10) 



n 



(2n - 4m + 2) 



3 — 7] 
l + T] 



2k 



,2m. I P3_ 

pI 



1 



P2 

+ 0(n2m'=+V2") 



l/2s 
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(c.logi/p^(n))i-iAni+V* ^ (alogi/p^(n))i'i/* 

~ [n3-2«]l/2 „l/2-l/t-a 

^0 asn^oo, if 1/2 - 1/t - a > 0. 

Thus, for a < 1/2, we can find t large enough such that 1/2 — 1/t — a > 0. 
□ 

In [10], Lippert, Huang and Waterman used a variation on Stein's resuh 
([15], page 110), due to Dembo and Rinott ([7], Theorem 4.2), to prove 

the fohowing result, under the assumption of i.i.d. letters, for the D2 = 
statistic. Let C be an alphabet set of size \C\ > 1 with nonuniform probability 
measure ^. Then for m = alogi/p2{n) + C with < a < 1/6, 

=>J\[0,1) asn^oo. 

Following simulations, it was noted in [10] that the bound 1/6 above is too 
small. Our simulations in Section 6 suggest that the bound should be closer 
to 2. 

By adjusting the proof of Theorem 5.3 to an alphabet set of any size 
\C\ > 1, we can improve the bound on a from 1/6 to 1/2. Thus we have the 
following theorem. 

Theorem 5.4. Let L he an alphabet set of size \C\ > 1 with nonuniform 
probability measure Then for m = alogi/p2{n) + C with <a < 1/2, 

D2{n)- fin d . 

=^Jv[0,l) as n—> 00. 

Proof. From the lower bound for D2 in [10], 



yai:{D2) > n 



(2m - l)(2n - 4m + 2)p^™(p3/p^ - 1) 
+ p^fl±Il^_^2m-l)p- 

V 1 -P2 



~ alogi/p^{n){n^ ^") since (ps/p^ — 1) > by Remark 3.3. 
Hence, in (10) in the proof of Theorem 5.3 we now have 

(iV„./M„)V*M„a 

CTn 

_ (nV((2m - l)(2n - 2m + l)))^/\2m - l){2n - 2m + 1) 
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^2 ^ 1/i 

<(- -— (2m-l)(2n-2m + l) 



(2m-l)(2n-2m + l 
X f n 



(2m - l)(2n - 4m + 2)p^™ ( ^ - 1 

(alogi/p^(n))i-V*ni+V* ^ (alogi/^^ (n)) V^'V* 
[alogi/p^(n)n3-2a]i/2 nVs-i/t-a 

^0 as n ^ cx), if 1/2 - 1/t - a > 0. 
The rest of the proof is the same as the proof of Theorem 5.3. □ 

Remark 5.5. When the underlying distribution is uniform, Yar{D^^) ~ 
■ Hence, this method of proof fails to show normal asymptotic behavior 
in the uniform case. In fact, for A; = 0, the distribution of 02'^ is not normal 
when |£| = 2, m = 1 and n — > 00 (see [10]). 



6. Numerical simulations. We have carried out numerical simulations 
of pairs of randomly generated sequences of length n = 100 x 2*, i = 0, . . . , 4 



with the nonuniform letter distribution iA = = \-, ic = = h- The statis- 



tic D2 was calculated for each sequence pair using an algorithm based on 
that given in [12]. Kolmogorov-Smirnov values [4] for the standardized 
statistic {D^"^ — fin)/crn compared with the standard normal distribution 
for sample sizes of 2500 sequence pairs are shown in Figure 1. Samples of 

(k) 

D2 which are a close approximation to the normal distribution will have 
p- values distributed uniformly on the interval [0, 1], whereas samples which 
are a poor approximation to the normal distribution will have small p- values. 

Entries in the tables in Figure 1 are shaded to indicate proximity of sam- 
ples to the standard normal distribution, with lighter shades signaling a 
better agreement. The white diagonal line in each table is m = 21ogi/p2 ^ ~^ 
const. The numerical evidence strongly suggests that 

— =^N{0,1) as n — i- 00, 

where the limit is taken along any line m = alogi^p^{n) + C with < a < 2 
for fixed k and C. 

7. Conclusions. We have studied the 02^^ statistic as suggested in [12], 
and defined it as the number of m-word matches with up to k mismatches 
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Fig. 1. Kolmogorov-Smirnov p-values for nonuniform d'"2^ with letter distribution 
=Ct = |, = Cg = I compared with a normal distribution. The white diagonal lines 
are m = alog^/p^n+ const., witha = 2 and '^/P2 = l/J2ae{A,c,G,T}^"-^T'- 
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between two sequences of length n, for strand-symmetric Bernoulli texts 
with a nonuniform letter distribution. We have extended methods applied 
in [10] to give upper and lower bounds on the variance, and have also studied 
the asymptotic behavior as the sequence length n and word length m tend 
to infinity for fixed k. 

(k) 

We have proved that the asymptotic distributional behavior of D2 is 
normal as n ^ 00 for a pair of strand-symmetric Bernoulli texts provided 
the limit is taken along any line m = alogi/p^ n C with < a < ^. For 
/c = this result is also shown to hold for a pair of Bernoulli texts with any 
nonuniform letter distribution. This improves the previous bound for the 
A; = case of a < g given in [10]. 

We have also carried out numerical simulations of strand-symmetric texts 
with letter distribution S,a = = \ , ic = = \ - These simulations strongly 
suggest that the optimum restriction on asymptotic normal behavior may be 
as high as a < 2. This is consistent with simulations in [10] and their result 
that the asymptotic distributional behavior of D^^ is a compound Poisson 
distribution for a>2. 

List of notations. 

D2: The number of m-letter word matches between two given sequences. 
D2 ■ The number of m-letter word matches with up to k mismatches (0 < 

k <m) between two given sequences. 
gk{m,r],c): The perturbed binomial distribution (Definition 2.2). Given a 

strand-symmetric Bernoulli text of length m and perturbation parameter 

r], and an m-word with GC-content c, this is the probability distribution 

of the number of character mismatches between the text and the m-word. 
Gfc(m, rj, c): The cumulative distribution function of the perturbed binomial 

distribution gk{m,r],c). 
hk{Tn,rj^c): For a given m-word with GC-content c, the probability that the 

word occurs at a given site in a strand-symmetric Bernoulli string with 

perturbation parameter 77. (See Definition 2.2.) 

Ju'- The dependency neighborhood of Yu^^ , where u= that is, the word 

locations v = {i' such that either the word at i' overlaps the word at i 

in the first sequence, or the word at j' overlaps the word at j in the first 

sequence, or both. (See Definition 4.3.) 
J^: The accordion, that is, the subset of Ju such that both the word at i' 

overlaps the word at i in the first sequence, and the word at j' overlaps 

the word at j in the first sequence. 
J^: The crabgrass, that is, the subset of Ju such that either the word at i' 

overlaps the word at i in the first sequence, or the word at j' overlaps the 

word at j in the first sequence, but not both. 



20 



C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON 



k: The number of mismatches. 
m: The word length. 

n: The length of each of the two sequences. 

n: n — m + 1, the number of possible locations of an m-word in a sequence 
of length n. 

Pt- "llaec^a^ where the sum is taken over the alphabet C. 
Mfc(m, 77, c), ry, c): Functions occurring in the definition of gk{m,r],c). 
(See Definition 2.2.) 

Yu''^ or y}-'^ : The indicator random variable for the event that the ?Ti-word 
starting at position i in the first sequence has no more than k mismatches 
with the m-word starting at position j in the second sequence. We use 
the convention u= {i,j),v = throughout. 

77: The perturbation parameter for a strand-symmetric Bernoulli text. (See 
Section 2.) 

S^a- The probability of finding the letter a at a given location in a strand- 
symmetric Bernoulli string. 
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